Multi-task learning in under-resourced Dravidian languages
نویسندگان
چکیده
Abstract It is challenging to obtain extensive annotated data for under-resourced languages, so we investigate whether it beneficial train models using multi-task learning. Sentiment analysis and offensive language identification share similar discourse properties. The selection of these tasks motivated by the lack large labelled user-generated code-mixed datasets. This paper works with YouTube comments Tamil, Malayalam, Kannada languages. Our framework applicable other sequence classification problems irrespective size Experiments show that our learning model can achieve high results compared single-task while reducing time space constraints required on individual tasks. Analysis fine-tuned indicates preference over single task resulting in a higher weighted F1 score all three We apply two approaches Dravidian Kannada, Tamil. Maximum scores Malayalam were achieved mBERT subjected cross entropy loss an approach hard parameter sharing. Best Tamil was DistilBERT soft sharing as architecture type. For sentiment identification, best performing scored F1-Score (66.8%, 90.5%), (59%, 70%) (62.1%,75.3%) respectively.
منابع مشابه
Eigentrigraphemes for under-resourced languages
Grapheme-based modeling has an advantage over phone-based modeling in automatic speech recognition for under-resourced languages when a good dictionary is not available. Recently we proposed a new method for parameter estimation of context-dependent hidden Markov model (HMM) called eigentriphone modeling. Eigentriphone modeling outperforms conventional tied-state HMM by eliminating the quantiza...
متن کاملJoint Bayesian Morphology learning for Dravidian languages
In this paper a methodology for learning the complex agglutinative morphology of some Indian languages using Adaptor Grammars and morphology rules is presented. Adaptor grammars are a compositional Bayesian framework for grammatical inference, where we define a morphological grammar for agglutinative languages and morphological boundaries are inferred from a plain text corpus. Once morphologica...
متن کاملAcoustic modelling for under-resourced languages
Over the past decades research in the field of automatic speech recognition has lead to systems with a sufficiently high grade of maturity that makes them suitable for use in real-life applications. However, such recognition systems have been developed only for very few languages. Languages addressed are mainly those with a large population, a high economic power, or for which a high political ...
متن کاملMulti-Task Learning Using Mismatched Transcription for Under-Resourced Speech Recognition
It is challenging to obtain large amounts of native (matched) labels for audio in under-resourced languages. This could be due to a lack of literate speakers of the language or a lack of universally acknowledged orthography. One solution is to increase the amount of labeled data by using mismatched transcription, which employs transcribers who do not speak the language (in place of native speak...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of Data, Information and Management
سال: 2022
ISSN: ['2524-6356', '2524-6364']
DOI: https://doi.org/10.1007/s42488-022-00070-w